{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adding a variable to capture NA\n",
    "\n",
    "In previous lectures we studied how to replace missing values by mean/median imputation or by extracting a random sample of the variable for those instances where data is available, and using those values to replace the missing values. We also discussed that these 2 methods assume that the missing data are missing completely at random (MCAR).\n",
    "\n",
    "So what if the data are not missing completely at random? By using this procedure, we would be missing important, predictive information.\n",
    "\n",
    "How can we prevent that?\n",
    "\n",
    "We can capture the importance of missingness by creating an additional variable indicating whether the data was missing for that observation (1) or not (0). The additional variable is a binary variable: it takes only the values 0 and 1, 0 indicating that a value was present for that observation, and 1 indicating that the value was missing for that observation.\n",
    "\n",
    "\n",
    "### Advantages\n",
    "\n",
    "- Easy to implement\n",
    "- Captures the importance of missingess if there is one\n",
    "\n",
    "### Disadvantages\n",
    "\n",
    "- Expands the feature space\n",
    "\n",
    "This method of imputation will add 1 variable per variable in the dataset with missing values. So if a dataset contains 10 features, and all of them have missing values, we will end up with a dataset with 20 features. The original features where we replaced the missing values by the mean/median (or random sampling), and additional 10 features, indicating for each of the variables, whether the value was missing or not.\n",
    "\n",
    "This may not be a problem in datasets with tens to a few hundreds of variables, but if your original dataset contains thousands of variables, by creating an additional variable to indicate NA, you will end up with very big datasets.\n",
    "\n",
    "In addition, data tends to be missing for the same observation on multiple variables, so it may also be the case, that many of your added variables will be actually similar to each other.\n",
    "\n",
    "\n",
    "### Final note\n",
    "\n",
    "Typically, mean/median imputation is done together with adding a variable to capture those observations where the data was missing (see lecture \"Replacing NA with the median/mean\"), thus covering 2 angles: if the data was missing completely at random, this would be contemplated by the mean imputation, and if it wasn't this would be captured by the additional variable.\n",
    "\n",
    "In addition, both methods are extremely straight forward to implement, and therefore are a top choice in data science competitions. See for example the winning solution of the KDD 2009 cup: \"Winning the KDD Cup Orange Challenge with Ensemble Selection\" (http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "===============================================================================\n",
    "\n",
    "## Real Life example: \n",
    "\n",
    "### Predicting Survival on the Titanic: understanding society behaviour and beliefs\n",
    "\n",
    "Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.\n",
    "\n",
    "\n",
    "### Predicting Sale Price of Houses\n",
    "\n",
    "The problem at hand aims to predict the final sale price of homes based on different explanatory variables describing aspects of residential homes. Predicting house prices is useful to identify fruitful investments, or to determine whether the price advertised for a house is over or underestimated, before making a buying judgment.\n",
    "\n",
    "=============================================================================\n",
    "\n",
    "In the following cells, I will show the mean/median imputation + creating an additional variable to indicate missingness on the Titanic and House Price datasets from Kaggle.\n",
    "\n",
    "If you haven't downloaded the dataset yet, in the lecture \"Guide to setting up your computer\" in section 1, you can find the details on how to do so."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# for regression problems\n",
    "from sklearn.linear_model import LinearRegression, Ridge\n",
    "\n",
    "# for classification\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.svm import SVC\n",
    "\n",
    "# to split and standarize the datasets\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "# to evaluate regression models\n",
    "from sklearn.metrics import mean_squared_error\n",
    "\n",
    "# to evaluate classification models\n",
    "from sklearn.metrics import roc_auc_score\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Survived</th>\n",
       "      <th>Age</th>\n",
       "      <th>Fare</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>22.0</td>\n",
       "      <td>7.2500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>38.0</td>\n",
       "      <td>71.2833</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>26.0</td>\n",
       "      <td>7.9250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>35.0</td>\n",
       "      <td>53.1000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>8.0500</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Survived   Age     Fare\n",
       "0         0  22.0   7.2500\n",
       "1         1  38.0  71.2833\n",
       "2         1  26.0   7.9250\n",
       "3         1  35.0  53.1000\n",
       "4         0  35.0   8.0500"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# load the Titanic Dataset with a few variables for demonstration\n",
    "\n",
    "data = pd.read_csv('titanic.csv', usecols = ['Age', 'Fare','Survived'])\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Survived    0.000000\n",
       "Age         0.198653\n",
       "Fare        0.000000\n",
       "dtype: float64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's look at the percentage of NA\n",
    "data.isnull().mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((623, 3), (268, 3))"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's separate into training and testing set\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(data, data.Survived, test_size=0.3,\n",
    "                                                    random_state=0)\n",
    "X_train.shape, X_test.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Survived</th>\n",
       "      <th>Age</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Age_NA</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>857</th>\n",
       "      <td>1</td>\n",
       "      <td>51.0</td>\n",
       "      <td>26.5500</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52</th>\n",
       "      <td>1</td>\n",
       "      <td>49.0</td>\n",
       "      <td>76.7292</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>386</th>\n",
       "      <td>0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>46.9000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>124</th>\n",
       "      <td>0</td>\n",
       "      <td>54.0</td>\n",
       "      <td>77.2875</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>578</th>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>14.4583</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Survived   Age     Fare  Age_NA\n",
       "857         1  51.0  26.5500       0\n",
       "52          1  49.0  76.7292       0\n",
       "386         0   1.0  46.9000       0\n",
       "124         0  54.0  77.2875       0\n",
       "578         0   NaN  14.4583       1"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create variable indicating missingness\n",
    "\n",
    "X_train['Age_NA'] = np.where(X_train['Age'].isnull(), 1, 0)\n",
    "X_test['Age_NA'] = np.where(X_test['Age'].isnull(), 1, 0)\n",
    "\n",
    "X_train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(29.915338645418327, 29.0)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# we can see that mean and median are similar. So I will replace with the median\n",
    "X_train.Age.mean(), X_train.Age.median(),"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Survived</th>\n",
       "      <th>Age</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Age_NA</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>857</th>\n",
       "      <td>1</td>\n",
       "      <td>51.0</td>\n",
       "      <td>26.5500</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52</th>\n",
       "      <td>1</td>\n",
       "      <td>49.0</td>\n",
       "      <td>76.7292</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>386</th>\n",
       "      <td>0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>46.9000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>124</th>\n",
       "      <td>0</td>\n",
       "      <td>54.0</td>\n",
       "      <td>77.2875</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>578</th>\n",
       "      <td>0</td>\n",
       "      <td>29.0</td>\n",
       "      <td>14.4583</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>549</th>\n",
       "      <td>1</td>\n",
       "      <td>8.0</td>\n",
       "      <td>36.7500</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>118</th>\n",
       "      <td>0</td>\n",
       "      <td>24.0</td>\n",
       "      <td>247.5208</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>157</th>\n",
       "      <td>0</td>\n",
       "      <td>30.0</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>127</th>\n",
       "      <td>1</td>\n",
       "      <td>24.0</td>\n",
       "      <td>7.1417</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>653</th>\n",
       "      <td>1</td>\n",
       "      <td>29.0</td>\n",
       "      <td>7.8292</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>235</th>\n",
       "      <td>0</td>\n",
       "      <td>29.0</td>\n",
       "      <td>7.5500</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>785</th>\n",
       "      <td>0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>241</th>\n",
       "      <td>1</td>\n",
       "      <td>29.0</td>\n",
       "      <td>15.5000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>351</th>\n",
       "      <td>0</td>\n",
       "      <td>29.0</td>\n",
       "      <td>35.0000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>862</th>\n",
       "      <td>1</td>\n",
       "      <td>48.0</td>\n",
       "      <td>25.9292</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>851</th>\n",
       "      <td>0</td>\n",
       "      <td>74.0</td>\n",
       "      <td>7.7750</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>753</th>\n",
       "      <td>0</td>\n",
       "      <td>23.0</td>\n",
       "      <td>7.8958</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>532</th>\n",
       "      <td>0</td>\n",
       "      <td>17.0</td>\n",
       "      <td>7.2292</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>485</th>\n",
       "      <td>0</td>\n",
       "      <td>29.0</td>\n",
       "      <td>25.4667</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Survived   Age      Fare  Age_NA\n",
       "857         1  51.0   26.5500       0\n",
       "52          1  49.0   76.7292       0\n",
       "386         0   1.0   46.9000       0\n",
       "124         0  54.0   77.2875       0\n",
       "578         0  29.0   14.4583       1\n",
       "549         1   8.0   36.7500       0\n",
       "118         0  24.0  247.5208       0\n",
       "12          0  20.0    8.0500       0\n",
       "157         0  30.0    8.0500       0\n",
       "127         1  24.0    7.1417       0\n",
       "653         1  29.0    7.8292       1\n",
       "235         0  29.0    7.5500       1\n",
       "785         0  25.0    7.2500       0\n",
       "241         1  29.0   15.5000       1\n",
       "351         0  29.0   35.0000       1\n",
       "862         1  48.0   25.9292       0\n",
       "851         0  74.0    7.7750       0\n",
       "753         0  23.0    7.8958       0\n",
       "532         0  17.0    7.2292       0\n",
       "485         0  29.0   25.4667       1"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's replace the NA with the median value in the training set\n",
    "X_train['Age'].fillna(X_train.Age.median(), inplace=True)\n",
    "X_test['Age'].fillna(X_train.Age.median(), inplace=True)\n",
    "\n",
    "X_train.head(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Logistic Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "scaler = StandardScaler()\n",
    "X_train = pd.DataFrame(scaler.fit_transform(X_train))\n",
    "X_test = pd.DataFrame(scaler.transform(X_test))\n",
    "\n",
    "X_train.columns = ['Survived','Age','Fare','Age_NA']\n",
    "X_test.columns = ['Survived','Age','Fare','Age_NA']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train set\n",
      "Logistic Regression roc-auc: 0.6794863451985858\n",
      "Test set\n",
      "Logistic Regression roc-auc: 0.7244940476190476\n",
      "Train set\n",
      "Logistic Regression roc-auc: 0.6762705798138868\n",
      "Test set\n",
      "Logistic Regression roc-auc: 0.7167857142857142\n"
     ]
    }
   ],
   "source": [
    "# we compare the models built using Age filled with median, vs Age filled with median + additional\n",
    "# variable indicating missingness\n",
    "\n",
    "logit = LogisticRegression(random_state=44, C=1000) # c big to avoid regularization\n",
    "logit.fit(X_train[['Age','Fare']], y_train)\n",
    "print('Train set')\n",
    "pred = logit.predict_proba(X_train[['Age','Fare']])\n",
    "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
    "print('Test set')\n",
    "pred = logit.predict_proba(X_test[['Age','Fare']])\n",
    "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))\n",
    "\n",
    "logit = LogisticRegression(random_state=44, C=1000) # c big to avoid regularization\n",
    "logit.fit(X_train[['Age','Age_NA', 'Fare']], y_train)\n",
    "print('Train set')\n",
    "pred = logit.predict_proba(X_train[['Age','Age_NA', 'Fare']])\n",
    "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
    "print('Test set')\n",
    "pred = logit.predict_proba(X_test[['Age','Age_NA', 'Fare']])\n",
    "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Support Vector Machine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train set\n",
      "SVC roc-auc: 0.692978460337086\n",
      "Test set\n",
      "SCV roc-auc: 0.7418154761904762\n",
      "Train set\n",
      "SVC roc-auc: 0.6917203531376761\n",
      "Test set\n",
      "SVC roc-auc: 0.7360714285714286\n"
     ]
    }
   ],
   "source": [
    "# we compare the models built using Age filled with median, vs Age filled with median + additional\n",
    "# variable indicating missingness\n",
    "\n",
    "SVM_model = SVC(random_state=44, probability=True, max_iter=-1, kernel='linear')\n",
    "SVM_model.fit(X_train[['Age', 'Fare']], y_train)\n",
    "print('Train set')\n",
    "pred = SVM_model.predict_proba(X_train[['Age', 'Fare']])\n",
    "print('SVC roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
    "print('Test set')\n",
    "pred = SVM_model.predict_proba(X_test[['Age', 'Fare']])\n",
    "print('SCV roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))\n",
    "\n",
    "# model build using natural distributions\n",
    "\n",
    "SVM_model = SVC(random_state=44, probability=True, max_iter=-1, kernel='linear')\n",
    "SVM_model.fit(X_train[['Age','Age_NA', 'Fare']], y_train)\n",
    "print('Train set')\n",
    "pred = SVM_model.predict_proba(X_train[['Age','Age_NA', 'Fare']])\n",
    "print('SVC roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
    "print('Test set')\n",
    "pred = SVM_model.predict_proba(X_test[['Age','Age_NA', 'Fare']])\n",
    "print('SVC roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the titanic dataset, including a variable to indicate missingness for Age did not show an improvement in the performance of the logistic regression and barely the support vector machine."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### House Sale Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# we are going to train a model on the following variables,\n",
    "\n",
    "cols_to_use = ['OverallQual', 'TotalBsmtSF', '1stFlrSF', 'GrLivArea','WoodDeckSF', 'BsmtUnfSF',\n",
    "               'LotFrontage', 'MasVnrArea', 'GarageYrBlt']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1460, 10)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>LotFrontage</th>\n",
       "      <th>OverallQual</th>\n",
       "      <th>MasVnrArea</th>\n",
       "      <th>BsmtUnfSF</th>\n",
       "      <th>TotalBsmtSF</th>\n",
       "      <th>1stFlrSF</th>\n",
       "      <th>GrLivArea</th>\n",
       "      <th>GarageYrBlt</th>\n",
       "      <th>WoodDeckSF</th>\n",
       "      <th>SalePrice</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>65.0</td>\n",
       "      <td>7</td>\n",
       "      <td>196.0</td>\n",
       "      <td>150</td>\n",
       "      <td>856</td>\n",
       "      <td>856</td>\n",
       "      <td>1710</td>\n",
       "      <td>2003.0</td>\n",
       "      <td>0</td>\n",
       "      <td>208500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>80.0</td>\n",
       "      <td>6</td>\n",
       "      <td>0.0</td>\n",
       "      <td>284</td>\n",
       "      <td>1262</td>\n",
       "      <td>1262</td>\n",
       "      <td>1262</td>\n",
       "      <td>1976.0</td>\n",
       "      <td>298</td>\n",
       "      <td>181500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>68.0</td>\n",
       "      <td>7</td>\n",
       "      <td>162.0</td>\n",
       "      <td>434</td>\n",
       "      <td>920</td>\n",
       "      <td>920</td>\n",
       "      <td>1786</td>\n",
       "      <td>2001.0</td>\n",
       "      <td>0</td>\n",
       "      <td>223500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>60.0</td>\n",
       "      <td>7</td>\n",
       "      <td>0.0</td>\n",
       "      <td>540</td>\n",
       "      <td>756</td>\n",
       "      <td>961</td>\n",
       "      <td>1717</td>\n",
       "      <td>1998.0</td>\n",
       "      <td>0</td>\n",
       "      <td>140000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>84.0</td>\n",
       "      <td>8</td>\n",
       "      <td>350.0</td>\n",
       "      <td>490</td>\n",
       "      <td>1145</td>\n",
       "      <td>1145</td>\n",
       "      <td>2198</td>\n",
       "      <td>2000.0</td>\n",
       "      <td>192</td>\n",
       "      <td>250000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   LotFrontage  OverallQual  MasVnrArea  BsmtUnfSF  TotalBsmtSF  1stFlrSF  \\\n",
       "0         65.0            7       196.0        150          856       856   \n",
       "1         80.0            6         0.0        284         1262      1262   \n",
       "2         68.0            7       162.0        434          920       920   \n",
       "3         60.0            7         0.0        540          756       961   \n",
       "4         84.0            8       350.0        490         1145      1145   \n",
       "\n",
       "   GrLivArea  GarageYrBlt  WoodDeckSF  SalePrice  \n",
       "0       1710       2003.0           0     208500  \n",
       "1       1262       1976.0         298     181500  \n",
       "2       1786       2001.0           0     223500  \n",
       "3       1717       1998.0           0     140000  \n",
       "4       2198       2000.0         192     250000  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's load the House Sale Price dataset\n",
    "\n",
    "data = pd.read_csv('houseprice.csv', usecols=cols_to_use+['SalePrice'])\n",
    "print(data.shape)\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LotFrontage    0.177397\n",
       "OverallQual    0.000000\n",
       "MasVnrArea     0.005479\n",
       "BsmtUnfSF      0.000000\n",
       "TotalBsmtSF    0.000000\n",
       "1stFlrSF       0.000000\n",
       "GrLivArea      0.000000\n",
       "GarageYrBlt    0.055479\n",
       "WoodDeckSF     0.000000\n",
       "SalePrice      0.000000\n",
       "dtype: float64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's inspect the columns  with missing values\n",
    "data.isnull().mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((1022, 10), (438, 10))"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's separate into training and testing set\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(data, data.SalePrice, test_size=0.3,\n",
    "                                                    random_state=0)\n",
    "X_train.shape, X_test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observed that the numerical variables are not normally distributed. In particular, most of them apart from YearBuilt are skewed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# let's make a function to replace the NA with median or 0s\n",
    "\n",
    "def impute_na(df, variable, median):\n",
    "    df[variable+'_NA'] = np.where(df[variable].isnull(), 1, 0)\n",
    "    df[variable].fillna(median, inplace=True)\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LotFrontage      69.0\n",
       "MasVnrArea        0.0\n",
       "GarageYrBlt    1979.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's look at the median of the variables with NA\n",
    "\n",
    "X_train[['LotFrontage', 'MasVnrArea', 'GarageYrBlt']].median()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# let's impute the NA with  the median\n",
    "# remember that we need to impute with the median for the train set, and then propagate to test set\n",
    "\n",
    "median = X_train.LotFrontage.median()\n",
    "impute_na(X_train, 'LotFrontage', median)\n",
    "impute_na(X_test, 'LotFrontage', median)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "median = X_train.MasVnrArea.median()\n",
    "impute_na(X_train, 'MasVnrArea', median)\n",
    "impute_na(X_test, 'MasVnrArea', median)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "median = X_train.GarageYrBlt.median()\n",
    "impute_na(X_train, 'GarageYrBlt', median)\n",
    "impute_na(X_test, 'GarageYrBlt', median)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LotFrontage       0.0\n",
       "OverallQual       0.0\n",
       "MasVnrArea        0.0\n",
       "BsmtUnfSF         0.0\n",
       "TotalBsmtSF       0.0\n",
       "1stFlrSF          0.0\n",
       "GrLivArea         0.0\n",
       "GarageYrBlt       0.0\n",
       "WoodDeckSF        0.0\n",
       "SalePrice         0.0\n",
       "LotFrontage_NA    0.0\n",
       "MasVnrArea_NA     0.0\n",
       "GarageYrBlt_NA    0.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train.isnull().mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['LotFrontage', 'OverallQual', 'MasVnrArea', 'BsmtUnfSF', 'TotalBsmtSF',\n",
       "       '1stFlrSF', 'GrLivArea', 'GarageYrBlt', 'WoodDeckSF'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create a list with the untransformed columns\n",
    "cols_to_use_no_na = X_train.columns[:-4]\n",
    "cols_to_use_no_na"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['LotFrontage',\n",
       " 'OverallQual',\n",
       " 'MasVnrArea',\n",
       " 'BsmtUnfSF',\n",
       " 'TotalBsmtSF',\n",
       " '1stFlrSF',\n",
       " 'GrLivArea',\n",
       " 'GarageYrBlt',\n",
       " 'WoodDeckSF',\n",
       " 'LotFrontage_NA',\n",
       " 'MasVnrArea_NA',\n",
       " 'GarageYrBlt_NA']"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cols_to_use = list(X_train.columns)\n",
    "cols_to_use.remove('SalePrice')\n",
    "cols_to_use"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# let's standarise the dataset\n",
    "scaler = StandardScaler()\n",
    "X_train_no_na = scaler.fit_transform(X_train[cols_to_use_no_na])\n",
    "X_test_no_na = scaler.transform(X_test[cols_to_use_no_na])\n",
    "\n",
    "scaler = StandardScaler()\n",
    "X_train_all = scaler.fit_transform(X_train[cols_to_use])\n",
    "X_test_all = scaler.transform(X_test[cols_to_use])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Linear Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train set\n",
      "Linear Regression mse: 1161895545.483203\n",
      "Test set\n",
      "Linear Regression mse: 2212393764.7463093\n",
      "\n",
      "Train set\n",
      "Linear Regression mse: 1157194541.9444427\n",
      "Test set\n",
      "Linear Regression mse: 2197999822.6527514\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# we compare the models built using Age filled with median, vs Age filled with median + additional\n",
    "# variable indicating missingness\n",
    "\n",
    "linreg = LinearRegression()\n",
    "linreg.fit(X_train_no_na, y_train)\n",
    "print('Train set')\n",
    "pred = linreg.predict(X_train_no_na)\n",
    "print('Linear Regression mse: {}'.format(mean_squared_error(y_train, pred)))\n",
    "print('Test set')\n",
    "pred = linreg.predict(X_test_no_na)\n",
    "print('Linear Regression mse: {}'.format(mean_squared_error(y_test, pred)))\n",
    "print()\n",
    "linreg = LinearRegression()\n",
    "linreg.fit(X_train_all, y_train)\n",
    "print('Train set')\n",
    "pred = linreg.predict(X_train_all)\n",
    "print('Linear Regression mse: {}'.format(mean_squared_error(y_train, pred)))\n",
    "print('Test set')\n",
    "pred = linreg.predict(X_test_all)\n",
    "print('Linear Regression mse: {}'.format(mean_squared_error(y_test, pred)))\n",
    "print()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "14393942"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#  what is the difference in price estimated by the 2 models?\n",
    "\n",
    "2212393764-2197999822"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, when we build a model using the additional variable to capture missingness of data, we observe in the test set that the mse is smaller. This means that the difference between the real value and the estimated value is smaller, and thus our model performs better.\n",
    "\n",
    "There is a difference of ~14 million between the model that replaces with the median and the one that uses median imputation in combination with the additional variables to capture missingness. So even when the difference in mse seems small, when we boil it down to business value, the impact is massive.\n",
    "\n",
    "For a discussion on why the median imputation is not enough in this dataset, refer to lecture \"Replacing NA by mean or median\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train set\n",
      "Ridge Regression mse: 1162112961.4104362\n",
      "Test set\n",
      "Ridge Regression mse: 2203311969.187996\n",
      "\n",
      "Train set\n",
      "Ridge Regression mse: 1157413427.915226\n",
      "Test set\n",
      "Ridge Regression mse: 2188469437.2202415\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# we compare the models built using Age filled with median, vs Age filled with median + additional\n",
    "# variable indicating missingness\n",
    "\n",
    "#  Ridge, is a regularised linear regression.\n",
    "\n",
    "linreg = Ridge(random_state=30, max_iter=5, tol=100, alpha=10)\n",
    "linreg.fit(X_train_no_na, y_train)\n",
    "print('Train set')\n",
    "pred = linreg.predict(X_train_no_na)\n",
    "print('Ridge Regression mse: {}'.format(mean_squared_error(y_train, pred)))\n",
    "print('Test set')\n",
    "pred = linreg.predict(X_test_no_na)\n",
    "print('Ridge Regression mse: {}'.format(mean_squared_error(y_test, pred)))\n",
    "print()\n",
    "\n",
    "linreg = Ridge(random_state=30, max_iter=5, tol=100, alpha=10)\n",
    "linreg.fit(X_train_all, y_train)\n",
    "print('Train set')\n",
    "pred = linreg.predict(X_train_all)\n",
    "print('Ridge Regression mse: {}'.format(mean_squared_error(y_train, pred)))\n",
    "print('Test set')\n",
    "pred = linreg.predict(X_test_all)\n",
    "print('Ridge Regression mse: {}'.format(mean_squared_error(y_test, pred)))\n",
    "print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe the same conclusion when using regularised linear regression.\n",
    "\n",
    "**To see the power of adding an additional category to indicate missingess, go to lecture 21, \"Creating an additional variable to capture missingness\" on the section \"Engineering missing values for categorical variables\"**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "**That is all for this demonstration. I hope you enjoyed the notebook, and see you in the next one.**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": "block",
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
